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A STATUS REPORT ON THE STUDY OF ,TEACHER EFFECTIVENESS* 

David C. Berliner / 
Far West Laboratory for Educational Research and Development 
San Francisco, California 94103 

Advocates of performance or competency based teacher education, 
state mandated evaluation programs, such as the Sjtull Bill in California 
*^^^d teacher accountability systems, all suffer to some degree from 
ostrichism. Ostrichism is aconmon disease often afflicting education. 
It's etiology is in a premature commitment to a particular educational 
movement. Behavioral symptoms include th^ practice of sticking one's 
head into the sand when problems appear,' in the hope that the problems 
will go away. 

The/ particular educational movement which is inducin^g the current 
epidemi'd of ostrichism is the commitment of educators to competency 
training and evaluation without the existence of empirical evidence 
linking \teacher behavior to student outcomes in classroom settings. 
The Colel^an report (1966), and its offsHoots (Jenks, 1972; Mosteller 
and Moynihan, 1972), have minimized the role of the teacher in account- 
.ing for educational outcomes. These investigators claim that family 
background, socioeconomic status, ethnicif^'and the like, are the major 
causal variables affecting between school differences in achievement. 

In that same tradition is the criticism of Heath and Nielson (1974). 
Their review of the studies of teacher clarity, use of student ideas, 
criticism, enthusiasm, and other variables commonly accepted as skills 
or competencies, ha§ revealed serious flaws in the 'extant research. 
They concluded first that there is no e'stablished empirical relation 
. between teacher behavior and student achievement. Second j that the* 
flaws in the research are due to nonsensical statistical analyses, 
weak researth designs, and Bterile operational definitions of teacher 
behavior and student outcomes. And third, that because o£ the strong 
association between omnibus measures of student achievement and socio- 
economic and ethnic status, the effects of teachers and techniques of. 
teaching on achievement are bound to be trivial. \ 

These are serious criticisms of the effects of teaching oW student, 
achievement. Yet unless replicable findings relating teaching behavior 
to stuiient achievement in natural classroom settings can be found, the 
performance and competent based teacher education, evaluation, and 
accountability programs will not be believable. Let us. remember that 
the heart of the performance and competency based approaches to teacher 
education, teacher evaluation and ^eacher accountability has to be the 

*The ideas presented in this paper have emerged fron^ discussions with 
the staff of the Beginning Teacher Evaluation Study of the Far West 
Laboratory for Educational Research and Development • This is'^^a project 
of thei California Commission on Teacher Preparation and .Licensing , 
fundedj^by the National Institute of Education. The comments of Margaret 
Bierly, Leonard Cahen, N5kki Filby, Charles Fisher and Marjorie Powell 
are gratefully acknowledged. ^ 



empirically Established Relationship between teacher behavior^s an 
independent V^ariable and student cognitive and affective outcomes as 
dependent variables. Whether we are interested in. effective science 
teaching, as this group is, or effective mathematics or home economics 
teaching^ .estiblisJtiin& empirical relationships- between t-eacher behavior 
and student oiitcomes has to be our 'goal. 
I 

Ferment ejjcists because performance and competency based education, 
in all its fofjns, has been sold before it really exists (cf . Shanker, 
1974). Those Who use rese^^h to criticize teachers, teaching, and 
performance ba^ed teacher education, as well as those who defend teachers, 
teaching and p(irformance bashed approaches have all taken positions before 
they have the necessary empiric^ backing. There is not now, and there 
will not b^orf sometime, any empirical evidence to take any firm position 
on these issues. Extremely important problems hamper the study of 
teachers and teaching in alj subject matter areas v I believe it will 
take yeafs befcrel these problems pan even be understood well* enough to 
do classroom rek^arch properly, k think you should keep in mind that 
the first step in the systematic /study of any phenomena is the r^cog- ' • 
nition of what broblems exist in/that reseafc^ area. Addressing these 
problems, rathet than assuming they will go away, or that they do not 
'^pply, will enhkwce the likelihood that studies\)f teacher effectiveness 
will be fruitful. The problems, as I see them,^^re loosely grouped into 
three categories concerned with the instrumentat;J.on, methodology and 
statistics used in studying how teachers affect the achievement of 
students. 



INSTRUMENTATION PROBLEMS 



There are 
inciependent^ar 



5 serious instrumentation problems connec}:jBd yrt 
id dependent variables comJnonly used iif^searc 



Our work at 



effectiveness. Six of those issues are discuss^^dhere^ 
Dependent Variable Problems 



ith both the 
iarch on teacher 



_ the Laboratory has 'been .hampered by an inability to . 

satisfactorily refeolve three problems connected with development of 
dependent variabljfes. These proBlems are connected with standardized 

special teaching units, and development of ^ult^vari- 

testing . In studies of how teachers affe*ct students. 



testing, tests of 
ate outcome measulres 



Standardized 



They usually have 
dictive of future 
overwhelming flaw 



standardized achievement' tests are extensively used criteria or out- 
come measures. Ti6se tests are, as a grOup, highly reliable instruments.. 

adequate curriculum content validity, and seem pre- 
academic success. These tests have, however,- one 
They simply may not reflect what was taught in any 



one teacher's clafisroom^. The tests are designed to be used in all kinds 
of courses within a curriculum^ area, and therefore cannot be completely 
sensitive or appr(i)priate for any one tea^s^her's teaching (ball, 1972). 
They simply lack (|ontent validity at the classroom l6vel. 

The standardized achievement tests ar^ also highly correlated with 
standardized intelligence tests, 'thus causing us to wonder exactly what 
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kinds of items are really used in these tests. Furthermore, the tests 
> ' are usually group administered multiple-choice tests; When working witfi 

young, bilingual, or lowerj socioeconomic status children, there Is a 
^ serious ^u^stion about whether many of the children are being appropri- 
^ ately tested. ^ t-y y 

In our owr) work, when standardized tests must be used, we«try'to ' 
refine the items in a number of ways. We try to choose 'items where 

_ there is evidence of substantial change in difficulty level over some ' 
instructional period. In this way we hope to identify items that are 
reactive to instruction* ^ We try to pick items that correlate weakly 
with a measure of general intelligence, like the Raven's Progressive 
Matrices test, rather than pricking those items with higher saturations 
of general intelligence. We try to have teachers rate items on how much 
time it would take them td teach that idea, or,' how much emphasis they 
put on material like that addressed by the item. Unless items on a 
standardized test are put tfirough a systematic screening of this type, 
the tes^ is not going to be patticularly rea\:tive to teaching. Off-the- 
shelf standardized tests make poor dependent variables for studies of 
teaching. This is t^art of the difficulty in interpreting the Coleman 
report. The t^sts they used in that study were more reactive Co family 

^^^^h^ck^Tound and ethnicity than they 'were , to instructional events within 
the s'chool^^It does not directly follow from this kind of evidence that 
teachers have* no effect on student achievement. 

Tests for special teaching units . To insure'^the use of tests 'that 
are content valid for a particular classroom, many investigators of 
teaching have created, special teaching units, ©r content vehicles to 
• Study teaching (Berliner and 'Ward, 1974; Joyc4, 1975; Popham, 1971).' An 
experimenta^l unit of this type contains ^curricula materials, oTjjectives, 
and sample test items. The teacher is asked "to teach to the objectives. 
Th^ unit could be a single 30-minute lesson/, or require daily work o^^^r 
three weeks. Vnder these conditions every /teacher has similar materials 
and objectives to^ vork with/ Student;S are; pre and post tested with 
carefully cohstructed tests designed to ta(p many dimensions of the 
material in the exj}erimental teaching unit. The dependent variable in 
this situation is much more valid and miich more reactive 'to classroom 
teaching. In comparative studies of teaching effectiveness, these 
experimental teaching units, and their tests, have much to commend them. 
Each teacher has a similar chancd to try to produce^ gains in' student 
achievement: So^e teachers will be better-at, this than others. 

4 

Unfortunately., at this time in our research efforts, we do not 
know if the measures of teaching effectiveness .arrived at over a short 
period of time ptovide an estimate' of teacher effectiveness over a longer 
period of time. This methodology, which is used ir^ our resea^rch on 
teaching, allows us to use tests of high content validity that seem to 
accurately reflect classroom practice for a short period of time.' 3ut' 
this me^iodology may npt have any .predictive validity. We do not know 
if the linking of teachers on ef f ectiveness , as determined by the rela- 
tionships between student pre and post test scores associated with an . . 
experimental teaching unit, is at all correlated with a ranking bf those 
teachers over the whole school year. We will have information on this 
ifesue later in the year. Frankly, we do not now expect a measure of 
teaching effectiveness obtained over, a short period of time to correlate 
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very highly with a measure of teaching effectiveness for an entire 
School year* Thus studying teacher ef f ecti-C^eness with dependent mea- 
sures tied to, special teaching units may not, in our estimatijMi of the 
state of the art, be a |air characterization of teaching overihe long 
haul. Predictive validity with such materials_appears to -he dbo loiL. 

Multivariate outcomes . There are at ieas.t two dependent variables 
in any instructional interaction that should be of interest to us, O^ie 
of these is the achievement of the learner in the situation. This has 
been a commonly used*measure df instructional outcomes. , The other, less 
often examined, is the learner -feelings about the instructional situa- 
tion. We do not always ask students questions which ptbbe their liking 
-for their teacher or' the sutj^tt matter. We 'overlook inquiring about 
their enjoyment of' their classmates, .the degree of threat felt in the 
clasps, and whether or not they would take more courses in that area. " 
When 5'uch issues are addressed in research studies, the affective set 
of dependent measures is kept separate from the achievement measures. 

Otir problem in the research we do is to find waysi, to use multi- 
variate Qutcomes so that many kinds of ^achievement and affective 
responses are used- as indicators of tjie quality of classroom life for- 
a child. I think the problem is something like ^he difficulties in 
teaching reading. You can' get high comprehension at slow reading rates. 
Or you can get^Jow compxehens*£on at high rates^of reading. But it is 
obvious that th^re must be ^ome optimum multivariate outcome that 
simultaneously considers both reading comprehension; and speed. The 
same kind of -multivariate out c6me. measures , simultaneously considering 
both achidVement and afffec^ive outcomes is needed for research on teach- 
ing. If we do not consider what is learned and what is felt about that 
learning,* sifiwltaneously we stand to fractionate school learning into 
^pieces that dp not resemble the students' view of reality. 

Independent Variable Problems 

^. * * * 

Our vork^has also been hampered by problems 'connected with the^ 
independent Variables used in studies of teacher effectiveness. A 
majof difficulty we have encAintered is related to the issue of appro- 
priateness of teacher behavior. A second issue is related to the deter- 
mination of a unit of analysis for the independent varia'ble. A third 
issue is. concerned with the stability of teacher behavior, 

/ Appropriateness of teaching behavior . My colleagues and' I have * 
spent a good deal of time counting te^acher behaviors. We know* something 
about the number of higher and lower cognitive questions aske^ i|er unit 
time, we have counted the pate of positive verbal priase> the nimber of 
criticisms made, the number of probes, the frequency of explaii/ing links, 
etc. For many of these variables we have found a low correlation with 
some student outcomes measures. But in our classroom observations i;e 
have become acutely aware of the difference between a higher cognitive 
question. asked after a train of thought is running out, and the same 
type of question asked after a series o^ lower cognitive "questions has 
been used to establish a foundation from whic)^ to e:j:plore higher-order 
ideas. We have seen teachers ^sk inane' questions. We have ^een teachers 
direct questions to whac we believe was the wrong child. We have seen 
^osit^^yfe verbal reinforcement used with a new child in the class, one 
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who was trying to win peer group acceptance', arid whose behavior the » ■ 
teacher chose to use as a standard of excellence. We watclied silently 
as the class .rejected the intruder, while -the teacher's count in the 
verbal praise category went up and up and up. We have seen teachers, 
respond to student initiated questions with Irrelevant information. Me 
have seen teachers achieve a high rate of proofing student responses to • 
questions, seemingly without regard for the studertt or the kind of * 
initial response' given to a question. Some students were embarassed by 
the probing, with other students probes occurred at inappropriate times, 
and sometimes probes were not used when "the situation seemed to cry out 
for them. Similarly, we observed ^ekillful probing where a student's 
knowledge a^out an issue was brought out and shared with" the class, 
after a weak first response was giveri by that stvfdent. The teacher's 
questioning was as skillful as Plato's, but we had recorded only its 
frequency. 

All these" events have led us to reassess our strong behavioristic 
stance in the study of teaching. We^ still regard frequency counts as- 
very useful information. But we now feel quite strongly that the 
qualitative dimension, dealing with value judgements about appropriate 
use^of skills, must enter into our observations' of teaching. >Ve must 
^ address the appropriateness issue in order to study ,the information pro- 
cessing and decision making skills of human teachers. ' It is precisely 
these skills that provide the most important rationale for having human 
teachers in the classroom. 

The unit of analysi s for the independent varia1)le . Something else 
we have become acutely aware of in our studies of teacher effectiveness 
is the problem of the unit of analysis for/ characterizing the independent 
variable. Is the single teacher questior/th^. unit of interest? Is the 
question, along with the wait-time, the'^^^ Or is the teacher question, 
wait-time, and student answer the unitA/hich best characterizes the 
independent variable? And ^if the laj^er is most appropriate, does that 
transaction become part of an epis^e br strategy of even more complex 
dimensions and longer duration?^ Teachers follow strategies of 'question- 
ing and of discussion* 'In ayfnductive lesson the meaningful unit of 
analysis, may > be a one-hour of one-week episode that is concetried with 
the conservation of matt^ The individual question's, reinforcers, 
probes and stud^ent respc^ses may be trivial aspects of the overall *epi-- 
sode* We certainly i^d to think about new conceptions for the units 
underlying independent variables used in studies of teacher effective- 
ness • 

Something else aboyf the nature of an instructional episode has 
perplexed us. have found very little data describing the nature of 
the instructional activities and 'episodes a child engages in each day. 
Since instructional time appears to be an important variable in the 
leari^ing process (Wiley, 1973; Hamischfeger and Wiley, "1975) we need 
to obtain accurate records of how time has been allocated to the vari- 
ous instructional activities and epis odes we might' identify. The work 
t>f (jfump (1967) and the techniqueTTJT' Barker (1968), are useful starting 
poijfits for obtaining this kind o^ information. This^ perspective yields 
accurate descriptions qf the time a child spends in various activities 
and the time he is exposed to instructional episodes of various types. 
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Thfese activities and episodes can be treated as independent variables 
and may J>e causally related to various types of student outcomes. 

Stability of teac her behavior . ^ Before an observer enters a class- 
room tq code ti^acher behaviour in any**sensible way , he has to be sure of 
two things. First, that the^ frequency of the events he is trying to 
'9bserVe is higb enough so that at least one instance of the event will 
occur/ during the observation period. Second, the behavior to be coded 
should represent the teachers usual and customary way of behaving. Only 
if t^ese conditions are met can a teacher's behavior -be sensibly char-* 
act^rized by the frequency count or rating scale description obtained 
in observation of classroom activities. These basic requirements ^;tor 
observation must be -examined closely, 

. / ^ ^ 

\. Many studies relating teacher behavior to student outcome have 
^amined tea9her behavior that did not occur frequently. For example, 
j'akong 32 primary-grade, science teachers the -use of questions calling 
for identifying relationships, hypothesizing, and^ testing hypotheses are 
extremely rare events on ^any given occasion of observation (cf. Moon, , / 
X969; 1971). Another tase of low frequency events, in an important area 
of teaching, has to do with the management skills of teachers. We find 
that in some communities class room'management is uot too difficult. The 
students are motivated and parents exert tight behavioral control, so 
that traumatic disturbances are quite infrequent. In otfier communities 
serious management problems exist all day long. So we find that to 
observe instances of teacher behavior in the area of classroom managemerit , 
we must remember to take into account ecological factors. Furthermore, 
we have learned that even in settings where management problems usually 
occur with high frequency, certain teachers are so quick to establish a 
non-disruptive st>cial^ system -that , by the time the observer enters the 
class, particular kinds of events have been^ precluded from occuring. 



How then can one study teacher behavior when importatit variables in 
the study rarely occur? One answer, of course, is in denser observation 
than is customary,-^ Five one-hour observations of teacher behavior, 
which is unusually high for most studies of teaching, may simply not 
provide all the information an investigator may want. In addition, part 
of the answer is ih knowing when and where to observe. For example, 
the first two weeks of schooling wotild be important for a study of 
management skills in inner city schools. Simply trying for denser ^ 
observation, later in the yeair, in other typfes of schools, mlgik be 
wasted, 

V ' ' 

Thfe problem of estimating behavioral stability is partly relatfed 
tq the problem of the frequency of occurrence of behavior. When the 
frequency of a behavior. is low the correlations between the frequency 
of occurrence for certain events, over occasions (that is, a coefficient 
of stability for the behavior), will be low. But part of the problem in 
looking at stability of teacher behavior is quite distinct from tjt\e fre- 
quency issue. Think for* a moment about the characteristics you prize 
in a teacher: Usually, people think of "good", teachers as flexible. 
Such teachers are expected to chahge methods, techniques, and styles to 
suit particular students, curriculum areas,, time of day or year, etc. 
That is, the standard of excellence in teaching that we hold implies 
a teacher w^ose behavior is Injierently unstable. Needless to say there 



■ 11 



is a problem foi an observer who is. trying to measure a teacher •s cus- 



tomary and usua, 



ways o£ teaching 



For our study of teaching we have reviewed teacher stability, ovex 
occasions^ for a great many variables (Shavelson and Dempsey, 1975). 
The results are fascinating. On the laughable side are coefficients of 
stability from (fampbell's (1972) anal}^sis of science teaching at the 
junior high schcjol level, over two occasions. The Flanders Interaction 
Analysis System was used, and the stability coefficient, that is, the 
correlation betvfeen a teacher's standing on a measure across' two occa- 
sions was, for ^ measure of. indirectness in t^eaching (i/d ratio), -.90. 
On five occasioris Moon (1969; 1971) studied 32 primary grade science 
teachers trained in the Science Curriculum Improvement Study (SCIS) . 
The stability coefficient for the Flanders indirectness measure went all 
the way up to +.18; for the frequency of fact or recall questions, the 
stability coefficient w^s -.12; and for amount of teacher talk, only 
+.12. In Borg's (197-2) study, the behavioral stability of teachers was 
measured after draining in questioning techniques had taken place. The 
stability of thd ratio of higher-order to fact questions was .07. The 
rather large number of low and even negative stability coefficients 
which exist in the literature conf inns out belief that the independent 
variables we often work with in studies of teacher effectiveness are not 
fair indicators of a teacher's typical behavior. We are, so eager * to 
capture variables for data analysis Vith ouy rating scaled 'atid £riquency 
cQunts, that we seem to have forgotten to^ check if our methodology is 
appropriate to the phenomena we are interested in studying! 

Of course there are many exceptions to» the trend for teacher behav- 
ior to be unstable. We have found ratings of variables over 10 occasions 
diat yield high stability coefficients. These i^iclude stability coef-- 
ficients of .92 for teacher warmth; .79 for teacher "enthusiasm; and .83 
for teacher sensitivity (Wallen, 1969)1: We haVe found frequency counts 
demonstrating that a global variable composed of all types of reinforce- 
ment is reasonably stable over occasions, yiefding ^ stability coeffic- . 
lent of .64 (Trinchero, 1974). In the latter study, however, we find / 
considerable evidence pointing to'^ the lack of generalizability of sta-. 
bility coefficients acrpss different teacher populatibn^, curricula areas 
and student populations. For example, the stability coefficient over 
two occasions for the frequency of positive verbal teacher behavior was 
.04 far English teachers, Bxid .57 for social studies teachers., 

By examining the stability of teachers' behavior, which is used as 
the independent variable in studies of teacher effectiveness, we con- 
clude th^t: 1) some teacher behaviors that we think are important to 
study occur infrequently. To study them requires extensive 6bser;Vation 
in particular settings *at appropriate times; 2) some teacKeV behaviors 
that we think are important to study are basically; unstable over occas- 
ions. No practical amount of" observations will result in a. reliable 
estimate of a teacher's use of. these behaviors. Perhaps we , need to 
develop measures of variance instead of measures of central tendency to 
describe those behaviors; 3) some teacher behaviors ar:e stabje^over 
occasions y In general, but not always, ratings or high infefirence vari- 
ables, rather than frequency counts or low inf erence^^variahles , are the 
more st^le; 4) stability coefficients £or many teacKer behaviors will' 



not demonstrate ecological or population validity. Teacher behavior is 
iftoderated, as it should be . by the kinds of students and the variety of 
settings that teachers work in. Until we know more about which teacher 
behaviors fluctuate, and how and * why they fluctuate over time, settings, 
curricula, and populations, studies relating teacher behavior to student 
outcomes must remain primitive. 



METHODOLOGICAL PROBLEMS 

A loosely related set ,of issues has been grouped under the title 
problems^^n methodology. Each of the problems and issues mentioned is 
in some way hampering the development of reliable knowledge abo>ut the 
relationship between teacher behavior and student outcomes. 

Student Background and Teacher Effectiveness 

4 

One problem ii» studying the teaching process is estimating how much 
can legitimately be expected of teachers or schools as an influence on 
student growth. This problem is debated in educational philosophy, 
sociology and economics, as well as educational psychology. And this 
issue has ali?iady been mentioned when we discussed how procedures are 
needed to reduce the influence of intelligence and ethnicity on teSt 
performance.^ stud^g^ of teacher effectiveness. But the problem is 
even more pervasive T ,Can a teacher be held accountable if a perfectly , 
appropriate prescription is y.ven, and then not followed by students? ^ 
Suppose ^ teacher says, "reaa!ja?}id^ chapter and come to my office so we' 
can discus^it." Among sub-cultures that see schools as hostile or use- 
less, st4id§lt'5'''will not read the chapter and will not come in, to discuss 
it. Classes of such students may show minimum growth in achievement at 
the end of the year. And these low achieving classes may very well be 
made up of lower socioeconomic status children and ethnic minorities. 
Under these conditions, how much responsibility is to be placed on 
teachers for the low student performance? ^ 

On the other hand, with high intelligence,' high socioeconomic 
children-, growth in achievement takes place almost in spite of the 
teachers and' teaching. Can the achievement^ of s^^'dentsf in those settings 
be attributable to teachers, or, is it ^ fToduct Iff genetic and environ- > 
mental advantage, relatively unaffected by what teachers do? 

Since some children, often whole groups of children, may be unwill- 
ing to learn in the 'institutions we now use to educate them, and some 
children learn in those institutions regardless of . what happens to them, 
how do we go about attributing student achievement to what teachers do? 
In the case of low achieving students w^ feel we may have to evaluate 
teachers against some other criteria than student , achievement , yet to 
do 80 denies that teachers can and should make a difference in the 
achiev^ent of lower socio^nomic and minority children. I have no 
solutions to this problem. I oITly know it .exists and must be thought 
about as people naively discuss teacljer effectiveness without qualify*- 
iifg what they say by noting the. students ' background characteristics, ' 
particularly socioeconomic-status ,anji intelligence. 



The Subject Matter and Teacher £f f ectiveness> 



That student background characteristics influence test performance 
and almost all other aspects- of schooling is well established. V.^at was 
not so well understood, until recently, is that student perfoOnance in 
different curriculum areas is differentially affectedly those back- 
ground'characteristics. In the International Education Association's 
(lEA) cross-cultural study of student achievement (Postlethwaite , 1973), 
the variance accounted for by student background characteristics, such * 
as intelligence and Social class, was estimated for a number of subject 
matter areas. • Cl&arly highlighted, around the world, was that home 
influences on subjects like reading and social studies are very powerful* 
Those influences are so powerful, in terms of their accounting £or 
student achievement, that fhere may not be enough variance unaccounted 
for in the performance of students to attribute to the influence of 
teachers. 

But in other curriculum areas, student background accounts for 
much less variance. Physics, chemistry, French, Spanish, geometry, and 
trigonometry are not typically learned at home, , and therefore the schools 
account for more variance in these measures of achievement than for 
achievement measures iin reading, social studies or language arts. This 
does not mean that socioeconomic status and intelligence are not related 
to performance in science, foreign language or mathematics. It simply 
means that the influence of those background factors is much less, thus 
leaving more variance to potentially attribute to school and teacher 
effects . 

If we want to study teaching we should study it in those areas 
where we are most likely to be able to attribute an effect to teachers, 
after the influences of test unreliability and home background have 
been removed. Instead we typically study teaching in those subject 
areas where we are hardest pressed to causally relate teaching behavior 
to student outcomes. New approaches are called for. 

Normative Standards and Volunteer Samples "^in the Study of Teacher 
Effectiveness . 

Our own work and that of many of my colleagues, is, in simplest 
form, a comparison of the post-instruction test scores of classes that 
had similar pre-instruction test scores. These comparative differences 
in outcomes are believed to discriminate between more and less effective 
teachers. Our research approach is entirely normative. And in a norm 
referenced research study some teachers will always appear to be better 
than others. In fact, the whole sample of teachers in any study may 
be quite poor when judged against some absolute standards, and we woyld 
never know. 

Mor^vJJLkely, since studies of teacher effectiveness in natural 
environments require the informed consent of volunteer teachers, we are 
likely to do re^search with a sample of self-confident, relatively open, 
teachers, almost all of whom may be superior t6 a non-volunteer sample 
on an unknown number of unidentified dimensions. But in a norm refer- 
enced system, where- teachers are evaluated against other teachers, we 
will judge some of our sample to be less effective than others. This 
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is a silly research strategy, but one we cannot easily change. T6 bring 
about change in this approach we would need to impose criterion refer- 
enced achievement standards for £|achers', and require all teachers tt> 
participate in research of the type we 'are talking about. U^til we can 
do that, ^d I doubt we ever will,, we should never talk of effective and 
noneffective teachers. We are, at best, dealing with morf and less 
effective teachers, which is quite different from the absolute criteria 
implied by the terms effective and noneffective. And because our norm 
referenced research is done with volunteer samples, our statements about 
teacher effectiveness should also include some reference to the fact 
that these are more or less effective teachers from a sample of teachers 
that are themselves probably^ superior to the average teacher in an unknown 
number of ways . ■ " ^ 

Individua l Differences Among Students and Teacher Effectiveness ' * 

All teachers known that some of the things thev. do will not be 
effective with some 6f the children they teach. There is no feeling of 
failure when this occuxs, that's just the way things' are. Most teacliers * 
recognize this problem and modify instruction accordingly. They cus- 
tomize their behavior, as best they can, to fit the individual styles 
of students. Our research on teacher effectiveness, however, usually 
ignores this phenomena. We rarely collect enough individual difference 
measures on students to find out if particular teaching behaviors are 
differentially effective with different types of children. For example, 
from what we know aboutThow aptitudes and treatments interact (cf .^^/^^ 
Berliner and Cahen, 1973), we can expect that a highly structured cour^ 
in^cience, taught by a well organized somewhat dominant teacher, wlll/^ 
yfeld greater achievement for high anxious students than for low anxibus 
students. On the other hand, the low anxious student will probably per- 
form better than the high anxious student in the class of a science 
teacher providing only small amounts of guidance and using an inductive 
approach. In research on t^a<;her ef f ectlvei^ss we ordinarily find no 
relation to student achievement outcomes for 'teacher behaviors that 
help to define constructs like inductive or dieductive teaching style. 
Relationships may not appear because we cfo^t know how to partition 
stutients Into meaningful subgroups foi^-^m the two different treatments . 
might be uniquely applicable.. If we could have divided students into 
high and low anxious individuals, to follow our example, we might have 
found that teacher behaviors within each teaching style had important 
effects on student achievement. 

I have no doubt that the styles of teaching and teaching behavior 
recommended by, say, the curriculum guides accompanying new science 
curriculum projects are appropriate recommendation^ for some teachers, 
k when interacting with some students. But not all students! By not 
focusing on the individual aptitudes, styles, personality, and traits 
of the student, we mask the effects of teachers, thus making it almost 
impossible to establish empirical relations between teaching behavior 
and student outcome. * 

An equally important reason to use the aptitude-treatment inter- 
action approach is to find teacher behaviors that in' general have 
positive relationships with student outcomes, but are, in fact, nega- 
tively, affecting the performance of smfell nUmBerg of studentj^ Research 
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on teacher effectiveness has to begin searching for interactions as it 
•continues trying to establish moire general links between teacher behavior 
*aad student outcomes. 

Mediation of Teache r Effectiveness Through the Student's Behavior 

Another- aspect of classroom reality that mus't be brought into our 
designs for research on teaching skillfe and competehcies , is the fact 
that teacher behavior does not influence Student achievement directly. 
That is, a teacher's indi?:ectness , or questioning, or rei-nf orcement does 
not simply result in greater mathematics, reading, or science a«qhieve- ' 
ment. The link that must be understood is the behavior of the student ^ 
In the instructional setting. We are now convinced that the mediating ^ 
link so necessary to consider is a students active time-on-task. If 
teacher questions, reinforcement, warmth, and clarity are to affect out^ * 
comes, they pan o^ly do so by engaging and then keeping the student 's 
•attentit^n. If the student will attend, the possibility of learning ^ 
exists. .We need to look at teacher behaviors that affect student active 
leaming.X To do so^ means putting much more effort into clinical' studies. 
In this way an investigator can work one-to-one wi,th studeni^ trying 
to understand how the student allocates his attention, andJuDW nominal • ^ 

stimuli emitted by the teacher, become effective stimuli ^Jor that student. 
To think that there is p direct link between, say, a to^tther's questions 
which ^require the generation of hypotheses by studi^ji^, and the students' 
achievement on a science test is 6verly simple. Iifiermpdiate links in 
^^^^^^"^^^ ^^^^ require us to examine the stud<nxf??s attending and in- 
for«S>tjion processing behavior. ^---f ^ » 




Another aspect of the student that mji^^e thought about for research 
in teaching is the student's perspective^:^ the events that impinge upon 
him in classrooms. We do not know how.^^^h of what , we call skilled 
teaching is even perceived by the le^^gpT^.^ From the learners perspec- 
tive^^ perhaps "analysis" and "synth^^" level questit)ns are not dis- 
tinguishable. Students may dif f e^ptiate only "memory" and "thinking" 
questions. From the learner 's ^^ective the rate of reinforcement 
may be irrelevant. The teabh^^jfther is "nice" or "not nice" to stud- 
ents. I believe that some v/^n>les thought to be quite important by 
educational theorists are Mg^^^ unimportant , unperceived or unper- 
ceivable by students (c^^Sne, 1974). Students exposed to variables 
they canriot perceive or ^^variableis' they believe to be unimportant, 
may be unaffect;ad by^piT variables.^ We certainly need , to follow ^Snow's 
(1974) advice to r^^^hers ,that urges more detailed accounts of what 
learners do ^.r^^lRs'e to experimental treatments. 

Construct Vali^| feon and Teacher Effectiveness 

Thro^MFhe writings of the logical ppsitivists, and particularly 
the physic^ Bridgman, social scientists became aware of the critical 
^atur^^-language and operations in science* An initial development 
to fu^ljpr scientific understanding of some phenomena is a descriptive 
language that uses concepts, having common meaning among the scientists 
•'•-'•'^Eig in the same area. The intensive and extensive meaning of key 
cepts needs to be shared by the members of the scientific community 
less the overlap of shareid meaning, the less rigor the science can 
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develop. A case in point would be a term like **withitness*' from th^ 
study of teaching by Jacob Kounin (197t))t. The teacher who can spot • 
trouble before.it begins has 'S^ithitness . Such a teacher can be work- 
ing with one group of students and call out a studex;t's name at the 
other end of the room because he 'is beginning to cau^e a disturbance. 
That is "withitness/* 1 recently went into a classroom and one of the 
concepts that helped me organize what I saw was the concept of **withit- 
ness.';^ 1 felt perfe^rt^at home using the concept. It helped me make, 
sense out of the difiSeXent^ styles of two teachers I was observing. Yet 
the concept itself c^n^^ be rigorously defined and relies upon very 
subjective interpretation of phenomena. The corjstruct of 'Vithitness , " 
like many 6^ the concepts we wo,rk with, is useful, but inadeqjiiately 
defined • ^ y 

One way to increase the preciseness of dur concepts is to tie them 
through clear operations to the measurement of their occurrence • For 
example, we can take a concept like teacher warmth, and define it as . 
the number of times per day the teacher smiles • But is that- what we 
want to measure when we measure warmth? |t seems that the phenomena we ^ 
are interested in is fragmented beyond reco^ition when we use the 
occurrence of some molecular behavior to operationally define our terms. 

What we need to.dO in the study of teaching is to being incorpor- 
ating multiple methods of measurement into the studies we do (Campbell 
and^Fiske, 1959). -If we want to work with the concept of "withitness" 
or '^warmth," we negd to measure the concept from as many different per-, ' 
spectives as we cati. For example, we should measure a teachers warmth 
by self-report, student report, observer rating, frequency count of • 
smiles, percent of gestures regarded as affectionate, and anythitig else 
we can think of » Then, from the intercorrelations of tfie various 
Imprecise and imperfect measures of warmth, we can begin to understand 
thfe construct we so glibly use, but cannot clearly define. Extensive 
construct validation mttet take place or the impr^iseness of our language 
for describing the phenomena we are interested in will keep the empiri- 
cal study of teaching at its present primitive level • 

The Genexalizability of Measures of Effectiveness 

If we are going to try to characterize teachers as more or less 
effective; in order to see if the behavior of those teachers differ, we 
need to know if the teachers themselves maintain their rank ordering on 
ineasures of effectiveness over time and over subject matter areas. As 
part of our research, we reviewed studies that addressed this problem. 
There are about eight studies of teacher effectiveness over lengthy 
periods of time (see^ Shavelson and Dfempsey, 1975) • The mean of these 
correlations between tj^acher effectiveness measured two or more times 
is about ,20. This is based on data from predominantly primary age 
children tested with standardized reading and mathematics achievement 
tfests. Brophy's (1973) study presents some interesting data to-^con- 
sider. Residual gain scores over 3 years were examined for 165 elementary 
teachers. Twenty-eight percent of the teachers were consistent in their 
effects on students three -years in ^ row. Approximately 14 percent of 
the teachers in the study were consistently effective in producing higher 
than predicted reading and math achievement. And 14 percent of the 
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teachers were consistent 'in being associated with^ classes that had 
ecdres- lower than predicted in reading and mathematics three years in 
a row. Thirteen percent of the teachers showed linear increases in 
residual gains over the, three years • That is, they appeared to be 
getting more effecti\ie in their teaching. Simiiariy, 11 percent of 
the teachers showed a linear decrease over- that time period. They 
ceemed, to be getting less effective over titte. Forty-nine percent of 
the teachers in this sample i/ere^ inconsistent in the patterning of their 
residual Scores ove,r time.- 

In our review of sKprt term studies of teacher effectiveness, rang- 
ing across grade lev^els and all kinds of , curriculum, areas, we find that 
when the same content is taught to similar students (for example, teach- 
ing and reteachipg an. ecology lesson to- two samples of urban students), 
moderately stable estimates' of teacher? effectiveness aire obtained. But 
when different conten,t is t^Ught to two or more..grpups' of similar 
students, the effectiveness measures were no^ found to be stable. Sim- 
ilarly, when dJ^fferent content is taught to. the same students, estimates 
of effectiveness from occasion to otcasion are unst>able. Our own 
research, just (Completed, involved about 200 elementary school teachers, 
each of which taught a two-week, specially designed teaching unit in 
reading and mathematics. Residual gain scores for each subject matter 
were calculated. Theffe measures of ef f e'ctiveness using different 4:on- 
tent and the.' same students were correlated. From these data we find • 
that measurea/of effectiveness in the two curriculum areas correlate 
about .30» 



It appears that teachers d6 not, by and large, remain in a stable 
ordering oh measures pf teacher effectiveness. If, as we have discussed, 
the independent VaViables we typically look at are often uns^table, and 
measures of teachet; effectiveness also show instability, the possi- 
bility of correlating teacher behavior with student achievement to 
determine effective ^teaching behavior is quite limited. In iEact, unless 
we reconceptualize much of what we do in this research area, it is 
ludicrous I 



STATISTICAL PROBLEMS* 

We have examineH instrumentation and methodological problems, and 
turn now to a brief discussion of - the statistical problems associated 
with thejstudy of teach'er ef f ectiveness* The strategy ve use in <5ur " i 
research is to identify groups of teachers that differ in effectiveness 
and then to analyze the teaching behavior of the teachers in the con- 
trasting groujJfi* Our choice'of statistical techniques is. limit ^ to. 
those that apply when a single achievement test is administered to 
students prio^^ to /.and ^following, some teaching; and the teaching is 
considered an intervention that takes place with students who were not 
randomly assigned to ^classes. Under these conditions a statistical 
method" is re<?uired to discriminate between groups of teachers that 
differ sigpific«itly in average pupil gain* The basic problem is one 

* . V. 

-*Rob6rt W. Heath and Richard Marliave, perfoiflii'ed .the ^alyses that 
addressed the probietts discussed in this s.ection of the paper. 
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addressed over and over in educational research, 
change without a true experitnental design? 



How do you measure 



We have examined the while range of statistical techniques 'based on' 
regression approaches. We looked at the advantages and disadvantages of 
residualized raw scores, re^sidualized true scores, curvilinear adjust- 
ments and methods that corr*eclt 'for non-homoscedastic bivariate distri- 
butions. We have also examined ways to define effectiveness based 
simply on post test raw score differences, for classes that had similar 
pre test scores. And we find much to recommend this simplest of methods, i 
which avoids all pretense of sophisticated statistics. We have also 
found interesting pos^ibilitiiis in the new scaling methods, which avoid 
many of the assumptions of classical test theory. Groups of teachers 
that maximally differ from eacjih other can be identified with these -tech- 
niques, providing samples or more and less effective teachers within 
curriculum areas. 



CONCLUSION 

I stated above that the heart of performance and competency based 
teacher education^ evaluation land accountability programs is the 
establishment of empirical relationships between teacher behavior as 
an independent variable and student achievement as a dependent variable. 
. But before we can adequately establish those relationships we need to 
deal with the.problem^ of instrtimentation, methodology and statistics. 
We must come to .grips with the inadequacy of standardized tests, the 
unknown predictive validity of tests from special teaching units, the 
problem of building multivariate outcome measures, the problems of 
measurement of appropriateness of teacher behavior, the lack of exper- 
ience in choosing an appropria^ unit of analysis for describing teach- 
ing behavior, and the lack of stability of' many teacher behaviors. 

We need time to consider the problems of how student background ' 
affects measures of teacher effectiveness, what subject matters should 
be examined, how normative st^kndards and volunteer teachers affect what 
we can say about teachers an^- teaching, how' individual students react 
to teaching skills, how studejnts monitor and interpret a teacher's 
behavior in ways which may or may not coincide ^^ith how educational 
theorists interprer the phenomena, and we need time and resources to' 
do construct validation and studies of the generalizability of measures 
of teacher effectiveness. 

Finally, we need guidance on what techniques to use for measuring 

changes in the achievement df students in natural classrooms. 

( » 

When we have finished Examining this potpourri of problems, issues, 
and concerns, we will be re^dy to begin the scientific study of teach- 
ing. And if we cannot deal with all of these problems, perhaps we 
should simply acknowledge that teaching is, after all, a very complex 
set of events which cannot be easily understood. 



14 



REFERENCES 



Barker, R. G. (1968) Ecological Psychology: Concepts and Methods for 

Studying the Environment of Human Behavior. Stanford, California: 
Stanford University Press. 

Berliner, D. C. and L. S. Cahen (1973) Trait-Treatment Interactions 
and Learning. In^F. N. Kerlinger (Ed.), Review of Research in 
Education, 1. Itaska, Illinois.: F. E. Peacock Publishers. 

Berliner, D. C. and B. A. War^i (1974) Proposal for Phase- III Beginning 
Teacher Evaluation Study. San Francisco, Ca:itfornia: Far West 
.Laboratory for Educational Research and Development. 

, Borg,^W. R. (1972) ^ The Minicourse as a Vehicle for Changing Teacher 
behavior: A Three-Year Follow-up. Journal of Educa tional 
Psychology , 63, 572-579. ' ~ 

.Brophy, J. E. (1973) Stability of Teacher Effectiveness. American 
Educational Research Journal . 10, 245-252. 

1 

Campbell, D. T. and D. W. Fiske (1959) Convergent and Discriminant 
Validation by the Multi-Trait-Multimethod Matrix. Psycholocical 
Bulletin, 56, 81-105. - V ' 



Campbell 

Teache 



, J. R. (1972) A Longitudinal Study in the Stability of 
chers' Verbal Behavior. Science Education ;^56 > >/! ^ 89-96. 

Coleman, J» S., et al (1966J Equality of Educational Opportunity ^ ^ 
Washington, D» C: U.,S. Government Printing Office. 

Gall, M. D. (1-973) The Problems of "Student Achievement" in Research 
on Teacher Effects. San Francisco, California: Far West 
. Laboratory for Educational Research and Development (Report A73-2) . 

Gump, P. V. (1967) :fhe Classroom Behavior Setting: Its Nature and 

Relation to Student Behavior. U. S. Office of Education, Depart- 
ment of Health, Education and Welfare. Final Report, Project No. 
5-0334, Contract No . •OE-4-10-107 . Lawrence, Kansas: University 
of Kansas (mimeo). 

Hamischfeger.,' A. and D. E. Wiley (1975) Teaching-Learning Processes 
In Elementary School: A Synoptic View* Beginning Teacher 
Evaluation Study, Technical Report No. 75-3-1. San Francisco, 
California: Far West Laboratory for Educational Research and 
Development (mimeo) . 

Heath, R. W. and M. A. Nielsoo (1974) Performance-Based Teacher 
Education." Review of Educational Research ^ 44, 463-484. 

Jenks, C, et al. (1972) Inequality: A Reassessment of Family and 
Schooling in America. New York: Basic Books. 



15 



ERIC . . - -^aa 



Joyce, B. R. (1975) Vehicles foi; Controlling Content in the Study of 
Teaching. • Paper giveh at the meeting of the American Educational 
Research Associ^ition, Washington, D.C., April, 1975. 

Kounin, J. S; (1970) Discipline and Group Management in Classrooms . 
New York: H6'lt , Rinehart and Winston^ ' ' 

Moon, T. C. (1969) Study of Verbal Behavior Patterns in Primary ^ 
Grade Classroomsi During Science Activities. Unpublished doctoral 
dissertation, Mithigan State University, 

Moon, T. C. (1971) A. Study of Verbal Behavior Patterns in Primary 

Grade Classrooms During Science Activities. Journal of Researc h 
in Science Teachinf^ . 8, 171-177. 

Hosteller, F. and D. P. Moynihon (1972) On Equality of Educational 
Opportunity . New York: Vintage Books'! 

Popham, W. J. C1971) " Performance Tests of Teaching Proficiency: 

Rationale, Development, and -Validation. American Education al 
Research Journal . 8, 105-117. 

Postlethwaite, T. N. (1973) A Selection From the Overall- Findings of 

the lEA Study in Science, Reading Comprehension, Literature, French 
as a Foreign Lan^age, Englishes a Foreign Language, and Civic ' 
Education. Paris, France: International Institute for Educational 
Planning (mimeo report no. IIEP/STU/MISC/73.3 [Rev. 1]). 

Shanker, 'A. (1974) Competency-Based Teacher Training and Certif ication.- 
Aaceptable and Unacceptable Models. QUEST Conso^rtium Yearbook. 
Washington, D.C.: American Federation of Teacherfe. 

Shavelson, R. S. and N. Dempsey (1975) Generalizability of Measures 
of Teacher Effectiveness and Teaching Process. San Francisco, 
California: Beginning Teacher Evaluation Study, Technical Report 
1f2^ Far West-Laboratory far Eihicational Research and Development.. 

Snow, R. C. (1974) Designs for Research on Teaching.. Review of 
Educational Research . 44, 265-291. * ' 

Trinchero, R. L. (197^4)* Thre^TecTinical Skills of Teaching: Their-'* 
Stability and Effect on Pupil Attitudes and Achievement. Unpub- 
lished doctoral dissertation, Stanford University. 

Wallen,^N. E. (19.69^' Sausalito Teacher Education Project. Antl^ual- 
Report. S5n, Franeisco, California: Division of Compensatory 
Educati\>n, Bujreau of Professional Development, San Francisco 
State Colle^* (mimeo) . 

.Wiley, Di E.' (1973) Another Hour ^Another Day: Quantity of Schooling, 
,A Potent Path for Policy; Studies of Educative Processes . No. 3,' 
Chicago, Illinois: University of Chicago, July 1973. ' 

Winne, P. H. (1974) Teacher Effectiveness and Student Perceptions of 

Teacher Cues. Dissertation proposal, Stanford School of Education, 
(mlmeo). 



